Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Design) Formatted Parts #463

Merged
merged 13 commits into from
Dec 4, 2023
Merged

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Aug 29, 2023

This is pre-empting #458 and #461, but I figure they'll likely land before this.

The proposed design is close to that used by the JS polyfill, but there are some differences.

@eemeli eemeli added design Design principles, decisions Agenda+ Requested for upcoming teleconference formatting labels Aug 29, 2023
Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see this. Many initial comments...

exploration/0003-formatted-parts.md Outdated Show resolved Hide resolved
Past examples have shown us that if we don't provide a formatter to parts,
the string output will be re-parsed and re-processed by users.

## Use-Cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need more flesh.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aphillips This section has been expanded since this comment. Sufficiently?

exploration/0003-formatted-parts.md Outdated Show resolved Hide resolved
exploration/0003-formatted-parts.md Outdated Show resolved Hide resolved
Comment on lines 75 to 76
Each part should have at most one of `value` or `parts` defined;
some may have none.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense? Or would it be better to say:

Suggested change
Each part should have at most one of `value` or `parts` defined;
some may have none.
Each part MUST have either a `value` or `parts` defined.
A part MAY have a `value` that is the empty string.
A part MAY have a `parts` that is an empty list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion would make the proposed MessageFallbackPart invalid, as it does not include either value or parts. It's conceivable for other parts to also exist which do not include either, such as open/close expressions without an annotation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would an empty value or empty parts satisfy that? Or a fallback could have a string expression? Empty strings don't result in the erroneous emission of the string null 😉

I understand that it would "break" the current definition: we should decide what the shapes should be and make consistent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fallback when formatting to a string the {...} make sense as a visual indicator, but for a formatted-parts consumer some different representation could be better. So including an explicit value or parts would be misleading.

For open/close, it doesn't make sense to define their explicit parts shapes in this spec, but for JS I have them as:

interface MessageMarkupPart {
  type: 'open' | 'close';
  source: string;
  name: string;
  value?: unknown;
  options: { [key: string]: unknown };
}

There, the value would be 'b' for {b +html}, but it would not be set for {+html.b}. Setting it to an empty string would be misleading, as {+foo} and {|| +foo} could easily mean different things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #463 (comment) I'm suggesting to define separate interfaces for single-valued and multi-valued parts. This could extend to fallback parts and markup, as well.

exploration/0003-formatted-parts.md Outdated Show resolved Hide resolved

When the resolution or formatting of a placeholder fails,
it is represented in the output by MessageFallbackPart.
No `value` is provided; when formatting to a string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide the fallback value? I think we have some text in the spec that allows implementations or functions to supply their own fallback.


Question: should a goal be that the string output of a message be equivalent to concatenating the string representation of its parts? Or at least that a test be that one can assemble the string output from the parts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide the fallback value? I think we have some text in the spec that allows implementations or functions to supply their own fallback.

Fallback customization is only available for syntax and data model errors, which default to . That would be used as the source value here.

Question: should a goal be that the string output of a message be equivalent to concatenating the string representation of its parts? Or at least that a test be that one can assemble the string output from the parts?

I think the latter. With the current proposal, it's doable like this:

function stringifyParts(parts) {
  let res = ''
  for (let part of parts) {
    if (part.type === 'fallback') res += `{${part.source}}`
    else if ('value' in part) res += String(part.value)
    else if ('parts' in part) {
      for (let sub of part.parts) res += String(sub.value)
    }
  }
  return res
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the parts are technically not formatted yet?
I can't get from the proposal what a sub.value is.
Can it be a date?

exploration/0003-formatted-parts.md Outdated Show resolved Hide resolved
and in most cases it's presumed that the sub-part `value` would be a string.

```ts
interface MessageDateTimePart {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define the "parts" so that they are generic rather than each type having its own special part type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the MessageExpressionPart definition above, which these interfaces also match. The definitions here are giving more specificity about what e.g. :datetime and :number end up producing, i.e. that they have explicit type identifiers and define parts rather than value.

interface MessageDateTimePart {
type: "datetime";
source: string;
parts: Iterable<{ type: string; value: unknown }>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice in my example earlier in this review that I exposed the field name as a field in the parts. This becomes important when trying to decorate an Iterable whose contents shift around due to the locale/localized formatting. Dates have this feature (YMD, DMY, MDY). So do currency values (which may or may not have a decimal part, may have the symbol first or last, and may or may not have a space around the symbol). That's how the screen shots of currency values (from amazon.com and amazon.fr) get decorated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the "type" field here. I picked that rather than "name" because it's used by the JS Intl formatters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was super non-obvious to me (hence this conversation!), particularly since the type fields in the examples seemed to be focused on the "type" of formatter (datetime, number) rather than on the parts field within them. Admittedly, the MF2-level parts will be at the placeholder level. Interior parts are the problem of the formatter. But this was not at all clear and probably could use an example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure that it's explained in more detail in the spec PR, should this design doc be accepted.

@eemeli
Copy link
Collaborator Author

eemeli commented Oct 24, 2023

Just to note, progress on this design doc appears to be blocked by lack of review for the past two months.

exploration/0003-formatted-parts.md Outdated Show resolved Hide resolved
exploration/0003-formatted-parts.md Outdated Show resolved Hide resolved
@aphillips
Copy link
Member

Pinging @mihnita: You have an action to review/comment on this doc.

@aphillips aphillips changed the title Add design doc for formatted parts (Design) Formatted Parts Nov 12, 2023
@mihnita
Copy link
Collaborator

mihnita commented Nov 12, 2023

My main concern (also expressed in Seville) is that the proposed model is conceptually a tree
(2 levels deep, but maybe more, with the deeper levels unspecified).

top level, the formatted-parts result is an iterable sequence of parts
One of them is MessageExpressionPart, which contains parts?: Iterable<{ type: string; value: unknown }>

So from what I understand the MessageExpressionPart contains what a formatter function returns.
But the proposal stops there, and does not go to explain what the parts returned by a formatter would be.

Let's take an example: I'll be out of office {$vacation, :dateRange}
The result might be something like "I'll be out of office November 27-December 7, 2023"

If I read this proposal correctly, the "November 27-December 7, 2023" would be described by an MessageExpressionPart, but there is no explanation on what that looks like.

So the most useful part for people's use cases ("how can format the months differently") is left out.


It also looks to me like the result can't be used as is, or serialized, as it is not yet "formatted"
The parts?: Iterable<{ type: string; value: unknown }> still contain unformatted(?) values (value: unknown).

I can't format-to-parts server side and send the result to a client, for example.


I looked at existing implementations of similar concepts in iOS / macOS, Android, ICU4C / ICU4J, Eclipse SWT.

And all of them use "flat" structures: a text, with information about ranges.


It is not a bad design in itself, standalone.

But my main concerns are:

  • it is stops the description to where it starts being interesting
  • the "impedance" between this proposal and existing implementations (nesting vs flat text with attributes)
  • the result can't be used as is, or serialized, as it is not yet "formatted"

Do we want to go in that direction?

I will document the existing implementations in a separate comment, for readability.

@mihnita
Copy link
Collaborator

mihnita commented Nov 12, 2023

macOS & iOS

AttributedString

var attributedString = AttributedString("This is a string with empty attributes.")
var container = AttributeContainer()
container[AttributeScopes.AppKitAttributes.ForegroundColorAttribute.self] = .red
attributedString.mergeAttributes(container, mergePolicy: .keepNew)

The attributes are in an AttributeContainer (key-value)

APIs to access:

@mihnita
Copy link
Collaborator

mihnita commented Nov 12, 2023

Android

android.text.Spanned

API: to access the span info: T[] getSpans(int start, int end, Class<T> type)

Classes:

Spanned(i)
	Spannable (i)
		SpannableStringBuilder (c) => mutable content & markup
		SpannableString (c) => immutable content, muttable markup

Example:

Spannable spannable = new SpannableString("0123456789_ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz");
spannable .setSpan(new ForegroundColorSpan(Color.BLUE), 10, 20, Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);

Spans supported: https://developer.android.com/reference/android/text/style/package-summary
(and can be extended with custom ones)

@mihnita
Copy link
Collaborator

mihnita commented Nov 12, 2023

ICU4J

FormattedValue

Classes implementing it:
DateIntervalFormat.FormattedDateInterval, FormattedMessage, FormattedNumber, FormattedNumberRange, ListFormatter.FormattedList, PlainStringFormattedValue, RelativeDateTimeFormatter.FormattedRelativeDateTime

Two ways to iterate:

  • nextPosition(ConstrainedFieldPosition cfpos) // cfpos contains the info.
  • AttributedCharacterIterator toCharacterIterator() => inherited from JDK (java.text.AttributedCharacterIterator)

ICU4C

icu::FormattedValue

Implemented in icu::FormattedDateInterval, icu::number::FormattedNumber, icu::FormattedList, icu::number::FormattedNumberRange, and icu::FormattedRelativeDateTime.

Iteration: nextPosition (ConstrainedFieldPosition &cfpos, UErrorCode &status) const =0

There is no equivalent of toCharacterIterator()

@mihnita
Copy link
Collaborator

mihnita commented Nov 12, 2023

SWT

StyledText

APIs: setStyleRange(StyleRange range), StyleRange[] getStyleRanges()

@aphillips
Copy link
Member

aphillips commented Nov 12, 2023

@mihnita

If I read this proposal correctly, the "November 27-December 7, 2023" would be described by an MessageExpressionPart, but there is no explanation on what that looks like.

It contains an Iterable of parts. The idea is that the formatting function (DateTimeRangeFormatter in this case) knows what parts there are and supplies that information. This way you can decorate the MessageExpressionPart as a thing and still decorate subparts (such as the month names). MF2 doesn't know or need to know about what kinds of parts (if any) a subordinate formatter has or returns.

This could be used to create an attributed string just as much as it could be used to create a sequence of DOM nodes. What is important is that the various boundaries are accessible to the caller and in the correct order for the localized message. It's important to know where the placeables were, but then know inside the placeable where sub-field boundaries are.

Conversely, an implementation of MF2 could return an attributed string describing the parts defined here (using attributes instead of objects)?

source: string;
parts?: Iterable<{ type: string; value: unknown }>;
value?: unknown;
dir?: "ltr" | "rtl" | "auto";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear to me why have dir and locale at this level.

We don't need a locale to format anything, because the parts should be already formatted.
The whole proposal is called "Formatted Parts"

The locale might be needed to render things.

Or to process the formatting result (fix grammatical agreements as a post-step, fix "a apple" to "an apple" (en) or "La abeille" => "L'abbeile" (fr), or to sentence case the result of "{item} is foo"

But then that is something that is needed for the whole collection of parts, not on MessageExpressionPart only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are needed to allow embedding content in a message that uses a different script or locale than the surrounding message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of clear what they can be used for.

But this info is on the MessageExpressionPart, which comes from an expression.
And an expression can't create this info out of nothing, it is probably something we passed as a parameter.
So if I already know the info (because I passed it to the expression), having it on the MessageExpressionPart is useless duplication.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this message, intended for consumption by a text-to-speech system:

In French, the number {98 :number} is commonly expressed as {98 :number @locale=fr},
but in Belgium it's {98 :number @locale=fr-BE}.

How would you propose that the locale information is transmitted, if not as a field on the formatted parts?

[
  { type: 'text', value: 'In French, the number ' },
  { type: 'number', source: '|98|', parts: [{ type: 'integer', value: '98' }] },
  { type: 'text', value: ' is commonly expressed as ' },
  { type: 'number', source: '|98|', locale: 'fr', parts: [{ type: 'integer', value: '98' }] },
  { type: 'text', value: ", but in Belgium it's " },
  { type: 'number', source: '|98|', locale: 'fr-BE', parts: [{ type: 'integer', value: '98' }] },
  { type: 'text', value: '.' }
]

As context, the fr number would be "quatre-vingt-dix-huit", while in fr-BE it's "nonante-huit".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language and direction are needed for placeholders because they represent values being inserted into the overall string. The language (locale) is used to ensure proper rendering and processing (such as line breaking, text transforms, or spell-checking). The direction is used to enable bidi isolation and get the direction of the substring correct.

The language and direction of a formatted part might not match that of the message due to resource fallback when looking up the message. Or because values passed in have different language or direction. (And we want bidi isolation even if the directions match!!!!)

Providing the fields in the formatted part structure allows the user to easily access the values, e.g. it makes it easy to do something like this, resulting in proper isolation of the formatted parts (not shown is decorating the parts separately):

var message = // whatever the host node is for the string
for (let part of formattedMessage.parts) {
    var span = document.createElement('span');
    span.lang = part.lang;
    span.dir = part.dir;
    span.appendChild(document.createTextNode(part.value));
    message.appendChild(span);
}

That is done on portions of the message. The whole message also has a language (locale) and base paragraph direction.

@mihnita
Copy link
Collaborator

mihnita commented Nov 13, 2023

It contains an Iterable of parts. The idea is that the formatting function (DateTimeRangeFormatter in this case) knows what parts there are and supplies that information.

Yes, but it looks like we are mixing the inputs with the outputs.
The input would be the pattern, with parameters, and pass that to MF2, which formats to parts.
MF2 already invoked DateTimeRangeFormatter, which already returned some kind of "sub-parts".

Now I need to be able to look at the output and access these parts and sub-parts without DateTimeRangeFormatter, which was input.

Conversely, an implementation of MF2 could return an attributed string describing the parts defined here (using attributes instead of objects)?

Yes, but that implementation would not be standard.
Because there is no equivalent of MessageExpressionPart with a list of items in attributed strings.
All attributes are at the same level, there is no part with sub-parts.

MF2 doesn't know or need to know about what kinds of parts (if any) a subordinate formatter has or returns.

Does not need to know, but if the formatters return parts that are "compatible" with what MF2 returns that is good.
First, because there is consistency. I don't iterate the MF2 parts in the result one way, and then the parts in a range format in another way.

If they "part id" is a string then I can do something css selection.

Example:

msg = "I'll be out of office {$vacation, :dateRange}!"

parts = "I'll be out of office November 27-December 7, 2023!"
attributedString "vacation" = "November 27-December 7, 2023"
attributedString "start" = range of "November 27"
attributedString "end" = range of "December 7"
attributedString "common" = range of  ", 2023"
attributedString "month" = range of  "November"
attributedString "month" = range of  "December"
attributedString "day" = range of  "27"
attributedString "day" = range of  "7"

The "attributes" overlap, same as bold / italic / other attributes overlap.
range of actually means the start offset + end offset.
But range of "November" is more readable than [22, 30]
All the APIs described above use that style.

But then I can iterate and I get something like this:

"I'll be out of office" => []
"November" => ["vacation", "start", "month"]
"27" => ["vacation", "start", "day"]
"December" => ["vacation", "end", "month"]
"-" => ["vacation", "separator"]
"7" => ["vacation", "end", "day"]
", " => ["vacation", "common"]
"2023" => ["vacation", "common", "year"]
"!" => []

And I can do the "css selection"-like operations:

  • select( parts, "vacation;month") => range of "November"
  • select( parts, "month") => collection of ranges (covering "November" and "December")

If we leave the sub-parts undefined it is more inconsistent in how I access the parts / subparts.
And the whole thing is a lot less useful.

exploration/formatted-parts.md Outdated Show resolved Hide resolved
exploration/formatted-parts.md Show resolved Hide resolved
exploration/formatted-parts.md Outdated Show resolved Hide resolved
interface MessageExpressionPart {
type: string;
source: string;
parts?: Iterable<{ type: string; value: unknown; source?: string }>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this subparts, items, values, chunks, fragments... anything but parts, which we already use to describe the things returned by the formatToParts API.

type: string;
source: string;
parts?: Iterable<{ type: string; value: unknown; source?: string }>;
value?: unknown;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more robust to describe the polymorphism of parts through generics rather than unknown. Something like the following:

interface MessageSingleValuePart<T> {
  type: string;
  source: string;
  value: T;
  dir?: "ltr" | "rtl" | "auto";
  locale?: string;
}

type MessageStringPart = MessageSingleValuePart<string>;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where in the spec would you propose we would be able to make use of this polymorphism?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to use it as a language for describing the concepts introduced by this doc; MessageStringPart is one example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added generic interfaces in line with your suggestion, but they're now referred to in the doc via comments, so that each part can retain its full definition in-place and a reader doesn't need to (but may) look up how they relate to each other. This also allows us to continue expressing how e.g. MessageFallbackPart explicitly does not have locale and dir fields, even if it does otherwise extend MessageSingleValuePart<"fallback", never>.

@mihnita
Copy link
Collaborator

mihnita commented Nov 27, 2023

Sorry, in vacation last week.

Would your concerns about the proposed API be satisfied if it could be shown how it could represent the above information you claim it cannot represent?

Sorry, but no.

I listed several concerns, so satisfying one is not enough.

Recap:

  1. overlapping ranges
  2. alternate representations on the same range (for example "27/12/2023" as text, and "December 27, 2023" as TTS)
  3. moving objects between server / client or across programming languages (for example PHP, or Go, calling a C++ layer implementation)
  4. "impedance" with the already established way the existing OS / tech stack does things

The "impedance" is a problem even if 1-3 can be represented with the above proposal, because the representation does not feel "natural" to a system that already uses attribute-style APIs.

Such a system would have several options:

  • implement the format to parts as this spec proposal says, which will not feel "natural" to such system
  • implement both this proposal, and something that matches theirs style => duplicate effort for the implementer, and "noise" for the user ("Why two APIs? Which one should I use?")
  • implement only something that feels natural, not this proposal => then being accused or not implementing the full MF2 standard ("Yes, they have something, but it is not the standard, why can't they follow standards? Evil company X!")

@aphillips
Copy link
Member

@mihnita Thanks for your comment. Do you have an alternate design approach to suggest?

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 27, 2023

I listed several concerns, so satisfying one is not enough.

Addressing each below.

  1. overlapping ranges

The example message you provided could be represented like this. If there is an example of a real-world message with improperly nesting ranges that you could point at, I'd be happy to show how such could be represented as well.

  1. alternate representations on the same range (for example "27/12/2023" as text, and "December 27, 2023" as TTS)

Should the target not be an input for the date formatter? I mean, if an environment sometimes has its messages formatted for display on a screen and sometimes for TTS, would it not be reasonable to expect that this target is an input to the date formatter as well? If this distinction is made only after formatting, then I would presume that the MF2-internal date formatter that it's using is a custom one, which emits parts that include a Date object of some sort, along with a basket of formatting options. That's certainly doable with the proposed approach.

  1. moving objects between server / client or across programming languages (for example PHP, or Go, calling a C++ layer implementation)

Could you clarify how this is a concern?

  1. "impedance" with the already established way the existing OS / tech stack does things

In case you missed it earlier, I've here provided an example partsToRanges() function that can bridge this sort of impedance. Is that not sufficiently simple to solve the impedance issues?

@mihnita
Copy link
Collaborator

mihnita commented Nov 27, 2023

moving objects between server / client or across programming languages (for example PHP, or Go, calling a C++ layer implementation)
Could you clarify how this is a concern?

I am not sure what's not clear here.

One use case is that formatToParts is executed in one environment (server, or C++ layer), and the result is passed to another layer (client, or JS / PHP / Go / Dart / whatever).
( I will use server / client for all cases, to simplify the discussion, but it is intended to cover everything)

If the a result has objects (value?: unknown), and to get the string representation of that object means I need to invoke a "stringify" operation on it, this means that I have to somehow copy and convert that object between layers. Potentially with executable code in it.

Because to "stringify" an object I need to have access to it.
Then I need to either:

  • know it's internals (so that I can write a toString(value) method)
  • or the object itself can stringify itself (has a value.toString()), which means I need to "move code" over the wire, or re-implement that code client side.

That is unnecessary with the attributed approach.
I can format something server side, and receive the string client side, and render and format it without knowing anything about the objects.

Example:
I call a server asking to format a message saying "Hello {$userId :personName}!" (might even be message id + userId).

If the server generates:

{
    "msg": "Hello John!"
    "attrs" : [
        {"start":6, "end":10, "userId:personName"}
    ]
}

then it is easy to send to the client, and I have all the info I need to render it, and format the name any way I want.
(all fields are known, and they are numbers and strings)

If the server generates

"parts": [
    "text": "Hello John!"
    "placeholder" : {
        "type": "userId:personName"
        value: unknown
    }
]

then I don't know how to send value over the wire, or how to stringify it.
I might not even know the type of that object.

And if I somehow marshal it to the client then I have to maintain all that marshaling code (to keep in sync when server side implementation changes), breaks encapsulation (the client now has to know about all kind of new types), and makes dependencies more complex.

@mihnita
Copy link
Collaborator

mihnita commented Nov 27, 2023

In one of the example above we had

  { type: 'number', source: '|98|', locale: 'fr', parts: [{ type: 'integer', value: '98' }] },

What if the locale was Arabic?
There is no way to tell if I need to use native Arabic digits or ASCII, to convert integer to string.
And there is no way to know if the original had some flags (minFractionalDigits, and so on).

So we technically the MF.formatToParts didn't format anything, the "stringification" step will need to be fully i18n aware. And with not have enough info to stringify (the attributes of the placeholders are lost, and there is no function name)
The parts in this example are basically the parse tree with information removed.

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 27, 2023

One use case is that formatToParts is executed in one environment (server, or C++ layer), and the result is passed to another layer (client, or JS / PHP / Go / Dart / whatever). ( I will use server / client for all cases, to simplify the discussion, but it is intended to cover everything)

If the a result has objects (value?: unknown), and to get the string representation of that object means I need to invoke a "stringify" operation on it, this means that I have to somehow copy and convert that object between layers. Potentially with executable code in it.

Another use case for MF2 is formatting in front-end environments that do not have such restrictions. For example, a seemingly simple message like

Click to continue: {$button}

could have its $button be a reference to a React component which must retain a connection to the same React instance with which it was created. In other words, it is an opaque value that is not stringifiable, and it is not transmissible over any boundaries.

To format that message, we cannot format it to a string because then we'd end up with [object Object]. So we format it to parts. And to allow that representation to represent this React component, the spec definition must include something like value?: unknown.

Would you agree that the above is a reasonable use case for MF2? If so, do you agree that this unfortunately leaves value?: unknown as the common denominator?

In the scenario you present, there are obviously further constraints that must be accounted for. These constraints may then be applied by the specific MF2 implementation, the functions you're using, or by checks on the post-processing for formatted-parts output. These restrictions should not limit what may be done by a different user of MF2.

In one of the example above we had

  { type: 'number', source: '|98|', locale: 'fr', parts: [{ type: 'integer', value: '98' }] },

What if the locale was Arabic?

Then the corresponding formatted-parts results would be:

{ type: 'number', source: '|98|', locale: 'ar', parts: [{ type: 'integer', value: '٩٨' }] }

The formatted parts are fully formatted parts.

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 29, 2023

@mihnita Are you satisfied by how each of your four concerns has been addressed?

@eemeli eemeli requested a review from stasm November 29, 2023 12:03
@mihnita
Copy link
Collaborator

mihnita commented Nov 30, 2023

@mihnita Are you satisfied by how each of your four concerns has been addressed?

I don't think I see any of my concerns addressed.

The only change I see is "Add MessageSingleValuePart & MessageMultiValuePart definitions; other…"

Which splits the previous MessageExpressionPart (which had an Iterable) into
MessageSingleValuePart and MessageMultiValuePart (which has an Iterable, and kind of corresponds to the old MessageExpressionPart).

Then value: unknown changes to value: V, which is as unknown as before.
Has all the limitations I described.
I still don't know how to send it over the wire, and I still don't know how to stringify it.
Unless we say that one of the special cases is that unknown is in fact a string (or something that extends string, to be more precise)?

It does not solve the "one level tree" problem, unless we say that V can be a MessageMultiValuePart
Then the whole thing becomes a deeper tree.
Which would address the deeper nesting (placeholder => range of dates with start / end date, each with fields)


In one of the example above we had
{ type: 'number', source: '|98|', locale: 'fr', parts: [{ type: 'integer', value: '98' }] },
What if the locale was Arabic?

Then the corresponding formatted-parts results would be:
{ type: 'number', source: '|98|', locale: 'ar', parts: [{ type: 'integer', value: '٩٨' }] }
The formatted parts are fully formatted parts.

So that means that in this case value is a string (or something that extends string).
Which is in fact what the parts returned by Intl.DateTimeFormat are, strings.

But in general "type" should not really be a type, is semantic, no?
Would be something like "year", or "day" (which happen to be numbers)
So the in example above "number" is really coming from the function that was used (:number), not a type.

Then maybe we need a name better than type.
It does not describe the type of value (which in some programming language is available at runtime anyway, with something like typeof)


You bring in React. And I get it.

Would you agree that the above is a reasonable use case for MF2? If so, do you agree that this unfortunately leaves value?: unknown as the common denominator?

Absolutely.


To summarize the current status, as I understand it:

  1. overlapping ranges

Because we have a bunch of unknown (ok V now), we can say "oh, but the value can be a MessageTextPart"
So one can imagine a placeholder that is a list formatter, where each element is a date range, each with start / end, and each with fields (day, month, year)
You get deeper nesting, a tree (because we "abuse" the value to be anything we want, so we put a full tree there)

  1. alternate representations on the same range (for example "27/12/2023" as text, and "December 27, 2023" as TTS)

Still not sure how to represent alternate streams of text (for screen rendering vs TTS)
Of course, by the magic of "unknown" we can put any object there. So you can say that I can put a map with alternate text values...

In general most things can be "solved" just because there are unknown types that can store anything.

But that way things get soo flexible that there is no point to have this in the spec anymore.
If I can write a JS implementation, all in JS, following the spec 100%, and you can write a JS implementation, also all in JS, also following the spec 100%, but one can't swap the two implementations, it means the whole thing is kind of pointless.

  1. moving objects between server / client or across programming languages (for example PHP, or Go, calling a C++ layer implementation)

Still unclear. Except by saying "hey, the value V can be anything, even JSON or protobuf (or string).
Again, the magic of "unknown"

  1. "impedance" with the already established way the existing OS / tech stack does things

There is no solution for this, except for MF2 format to parts to return an annotated string.

@mihnita
Copy link
Collaborator

mihnita commented Nov 30, 2023

@mihnita Thanks for your comment. Do you have an alternate design approach to suggest?

Yes, an annotated string.
Which is conceptually what every existing framework seems to use.

And if this is a design document (as the title says) should at least document that option, compare with the current tree of unknown nodes, and explain why not do what everything else does.

@mihnita
Copy link
Collaborator

mihnita commented Nov 30, 2023

Still open: we don't need the MessageBiDiIsolationPart

Let me explain again, in a different way, maybe I can be more clear:

  1. how are these kind of parts generated?

Somehow MF takes the parse tree, with input arguments, and generates parts.

And the MessageBiDiIsolationPart are:

a. not generated from a bidi control character typed by a translator (see Eemeli's answer in a thread above)

b. not an from an explicit placeholder

Example:
"You have a message from {$user :person_name level=casual}"

Result, something like this:

{
   { type: "text" value: "You have a message from" } // MessageTextPart
  { type: "person_name " source: "$user" dir: "rtl", locale: "ar" value: "فلانة"  // MessageSingleValuePart
}

We already have info about the MessageSingleValuePart, that is Arabic, and direction is rtl.

The only place where MessageBiDiIsolationPart might be handy is before / after the MessageSingleValuePart
But we have all the info we need in the result above, without MessageBiDiIsolationPart.
Enough info that when we take that result we can add real bidi control characters, or HTML attributes, or elements):
That is stuff I do when I take the format to parts and generate the DOM, or spanned text.
It is a kind of post-processing on the result, similar to changing case, or making a section bold.

@mihnita
Copy link
Collaborator

mihnita commented Nov 30, 2023

making a section bold.

Speaking of bold: we have no parts for markdown.

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 30, 2023

You bring in React. And I get it.

Would you agree that the above is a reasonable use case for MF2? If so, do you agree that this unfortunately leaves value?: unknown as the common denominator?

Absolutely.

Good! So we agree that at the most generic level, a formatted part returned by an arbitrary MF2 implementation should have a shape conforming to MessageSingleValuePart<string, unknown>. The unknown value there corresponds to e.g. the Object what in Android Spannables.

And that's what this spec is specifying; the most generic possible definition within which all parts for all placeholders in all implementations must fit. So we need to leave a lot of space for the unknown, because we need to support all possible formatting targets (not just strings), while still explicitly defining what structure we can, and common fields such as source and dir.

Any implementation is of course allowed to be stricter than what's defined here in its outputs, and processors of formatted parts are allowed to specify that they require e.g. only string values, or assign other restrictions to what they're capable of dealing with. But these limitations are not universal, and do not apply to all cases where formatted parts are emitted or consumed.

In one of the example above we had
{ type: 'number', source: '|98|', locale: 'fr', parts: [{ type: 'integer', value: '98' }] },
What if the locale was Arabic?

Then the corresponding formatted-parts results would be:
{ type: 'number', source: '|98|', locale: 'ar', parts: [{ type: 'integer', value: '٩٨' }] }
The formatted parts are fully formatted parts.

So that means that in this case value is a string (or something that extends string). Which is in fact what the parts returned by Intl.DateTimeFormat are, strings.

But in general "type" should not really be a type, is semantic, no? Would be something like "year", or "day" (which happen to be numbers) So the in example above "number" is really coming from the function that was used (:number), not a type.

Then maybe we need a name better than type. It does not describe the type of value (which in some programming language is available at runtime anyway, with something like typeof)

The type here is specifying the type of the part, not of the value. The current proposal matches the formatted-parts structure used by the JS Intl formatters, which indeed always have string values. And for example the JS implementation's 'number' formatted parts will always have string values, which is allowed because it's a subset of unknown.

  1. alternate representations on the same range (for example "27/12/2023" as text, and "December 27, 2023" as TTS)

Still not sure how to represent alternate streams of text (for screen rendering vs TTS) Of course, by the magic of "unknown" we can put any object there. So you can say that I can put a map with alternate text values...

In general most things can be "solved" just because there are unknown types that can store anything.

Yes. Because we must allow for arbitrary values like React components to show up as values, we must allow for the possibility that those values are not primitive, but contain some internal structure.

But that way things get soo flexible that there is no point to have this in the spec anymore. If I can write a JS implementation, all in JS, following the spec 100%, and you can write a JS implementation, also all in JS, also following the spec 100%, but one can't swap the two implementations, it means the whole thing is kind of pointless.

This thing is not pointless. An emitter or consumer of formatted parts can be more specific than this spec in what it emits or consumes, allowing for interoperability.

It also has a rather important point in being a prerequisite for our markup definition.

@mihnita Thanks for your comment. Do you have an alternate design approach to suggest?

Yes, an annotated string. Which is conceptually what every existing framework seems to use.

And if this is a design document (as the title says) should at least document that option, compare with the current tree of unknown nodes, and explain why not do what everything else does.

Could you expand that alternative from its current definition? Right now it just defines it as "Format to a string, but separately define metadata or other values." It seems like you think this should be expanded?

making a section bold.

Speaking of bold: we have no parts for markdown.

Presuming that you mean "markup", then you'll find those included in the proposed design of the doc linked to from #537. Unless you really do mean "markdown", in which case I don't know what you mean.

@mihnita
Copy link
Collaborator

mihnita commented Nov 30, 2023

What about this: have both an unknown value, and a string one? (let's call it as_string for now)

It can be beneficial even for something like React.
For example:

msg = MessageFormat.fromString("Please click the {$button}")
parst = msg.formatToParts("button": rustButtonRef, ...)

Since {$button} is a placeholder, there is a function associated with it (explicit, {$button :react}, or implied from the type)
That function understands React objects. Potentially how to get the string representation from one.
The part it produces might contain the react ref as a value AND the string form.

That way you can post-process the result and fix grammar, or do other operations.
For example "... la [button]" adjusted to "... l'[button]"

The value can indeed be a Span (in Android), but the string value is not contained in a Span.
For example (real Android code):

SpannableString string = new SpannableString("Text with underline span");
string.setSpan(new UnderlineSpan(), 10, 19, Spanned.SPAN_EXCLUSIVE_EXCLUSIVE);

So the underline Span does not contain text.

It can also alleviate (maybe not completely?) the "alternate text" problem. Which sometimes is not even text:
So a :dateformat function might return:
{ as_text:"١٩/٧/٢٠٢٣" value:new TtsSpan.DateBuilder().setDay(19).setMonth(7).setYear(2023) }

And :number function might return
{ as_text:"١٢٣.٤٥٦,٧٨٩" value:new TtsSpan.DecimalBuilder(123456.789, /*minimumFractionDigits*/ 0, /*maximumFractionDigits*/ 2) }


presuming that you mean "markup", then you'll find those included in the proposed design of the doc linked to from #537.

Thanks. That's what I expected. But I see, it is addressed in a different place.

Unless you really do mean "markdown", in which case I don't know what you mean.

Now you are nitpicking on what my brain produces at 3 am :-)

Yes, I meant markup, "Markdown is a lightweight markup language" (Wikipedia)

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 30, 2023

What about this: have both an unknown value, and a string one? (let's call it as_string for now)

This sounds like something that an implementation could choose to do, but I don't really see why it should be enforced in all cases for all users?

As far as I know, practically all reasonable programming languages allow for arbitrary objects to define a way for their own stringification when they're coerced to a string. In both JS and Java, that can be done by implementing a custom .toString() method.

What's the benefit or requiring all formatters to always perform such a stringification?

Since {$button} is a placeholder, there is a function associated with it (explicit, {$button :react}, or implied from the type) That function understands React objects. Potentially how to get the string representation from one. The part it produces might contain the react ref as a value AND the string form.

At least in the JS Intl implementation, a value like that without an explicitly recognised type will be formatted to a string by calling String(button), and formatted to parts as

{ type: 'unknown', source: '$button', value: button }

This way, arbitrary input values can be accepted and "formatted" without the core implementation needing to understand them, or being passed functions supporting their handling.

@mihnita
Copy link
Collaborator

mihnita commented Dec 1, 2023

This sounds like something that an implementation could choose to do

But then it is not following the spec.

As far as I know, practically all reasonable programming languages allow for arbitrary objects to define a way for their own stringification

Do you know about C and C++? There is not standard way for an object to stringify itself.
Maybe they are not "reasonable programming languages" by your judgement, but they are still high ranking.
One of the ICU implementations in C/C++, and the Windows APIs are C.
So that means that ICU4C and Windows are not reasonable, and we should deny them the right to implement this standard.

Second, even programming languages that have standard ways to objects to stringify themselves, that is intended for debugging and for developer consumption, not for end user.
They are not necessarily readable, often intentionally so, and are locale independent.
So a React button with a label saying "Subscribe" will not have a toString implementation returning "Subscribe" in French.
Which is what you need to do linguistic fixes ("La [button]" to "L'[button]")

Third, saying that objects can stringify themselves ignores one of my requirements to be able to easily move "across borders" between programming languages, or server/client.

And if possible be consistent?
First (in #463 (comment)) you give an implementation of stringifyParts that require objects to be stringifiable.
Then you give example of React and you say "we cannot format it to a string because then we'd end up with [object Object]."
Then you say that objects that can stringify themselves are are everywhere (except for React, I suppose?)
What is it?

I really tried to see if we can make this proposal work your way, with "a tree of nodes."
I didn't push too much for the attributed strings. At least not in the last few days of interactions.
But answering piece-meal, with one answer contradicting previous answers, or solving one issue, but not ignoring other, will not move the discutient forward.

Do you think that all 4 bullets listed #463 (comment) are reasonable?
4 can't be solved, with "tree of parts" not matching the attributed strings already used by iOS, Android, and ICU.
At least the other 1-3 are reasonable?

exploration/formatted-parts.md Outdated Show resolved Hide resolved
exploration/formatted-parts.md Outdated Show resolved Hide resolved
@eemeli
Copy link
Collaborator Author

eemeli commented Dec 1, 2023

This sounds like something that an implementation could choose to do

But then it is not following the spec.

Yes, it would be. In case that was not implicitly clear (we're using TypeScript syntax, after all), I've added an explicit mention of the liberty to define additional fields.

As far as I know, practically all reasonable programming languages allow for arbitrary objects to define a way for their own stringification

Do you know about C and C++? There is not standard way for an object to stringify itself. Maybe they are not "reasonable programming languages" by your judgement, but they are still high ranking. One of the ICU implementations in C/C++, and the Windows APIs are C. So that means that ICU4C and Windows are not reasonable, and we should deny them the right to implement this standard.

My most sincere apologies. I thought operator<< would be considered as the standard way for C++, and honestly didn't think of C users who did want to format "arbitrary object" values without a framework like GObject that does provide for a standard stringification interface.

Now, leaving the snark aside (which tbh I'd also appreciate from you), do you see how requiring an as_string property would mean that a bare C implementation would not be able to support truly arbitrary void * values as input arguments for a format-to-parts MF2 function? Adding such a required property on all parts severely restricts such an implementation.

Second, even programming languages that have standard ways to objects to stringify themselves, that is intended for debugging and for developer consumption, not for end user. They are not necessarily readable, often intentionally so, and are locale independent. So a React button with a label saying "Subscribe" will not have a toString implementation returning "Subscribe" in French. Which is what you need to do linguistic fixes ("La [button]" to "L'[button]")

To reflect, we are here talking about formatting a message containing a placeholder {$button} (note, no annotation) in JavaScript, where the MF2 implementation will be powered by the ECMA-402 defined Intl.MessageFormat interface. Its current spec, and that proposed in the current design here, allow for input values of unknown type to pass through the parts formatter and come out the other side as

{ type: 'unknown', source: '$button', value: button }

where the button is the exact unmodified value that was passed as an argument to the formatToParts() method.

My understanding of what you're asking for is that even in this case, that formatted part should include something like as_string, yes? Please do correct me if I've misunderstood your position. Now, given that the core Intl.MF implementation is not going to have any React dependency hard-coded within it, any such as_string value will be generated by calling String(button), which will mean that it'll be "[object Object]".

Using such an implementation (ignoring any as_string fields, as they'll be useless), it's certainly possible for a user to take the formatted parts of a message, and if the user code can handle React components, to render the component part and extract localized text from the ultimately resolving HTML before feeding it into a system capable of applying linguistic fixes.

We are here defining the most specific and minimal data types that still enable for something like Intl.MessageFormat to be able to support behaviour as above. Enforcing an as_string field won't help with any of it.

Third, saying that objects can stringify themselves ignores one of my requirements to be able to easily move "across borders" between programming languages, or server/client.

Wait, I thought we'd agreed that something like value: unknown is required to support arbitrary objects, which cannot be transmitted across realms like that? There is literally no representation that allows for their transmission as you require, but we still want to support them in formatted parts.

What you're asking for is possible with further restrictions on what's enabled by this proposal. It is perfectly valid for e.g. a server-side formatter to have stricter requirements for what it supports as input data than just "anything goes", and for it to make stricter guarantees about its outputs, as required for transmission across the wire, for instance. On the other hand, it is also perfectly valid for a formatter to not impose such restrictions on its inputs, and not to promise anything stricter than what's in this proposal.

If we were to restrict this proposal to the requirements of "formatted parts must be transmissible as data", we could not support React components. And you said this earlier:

Would you agree that the above is a reasonable use case for MF2? If so, do you agree that this unfortunately leaves value?: unknown as the common denominator?

Absolutely.

Could you now clarify what you meant by "Absolutely", because I took that to be agreement with my preceding statement?

And if possible be consistent? First (in #463 (comment)) you give an implementation of stringifyParts that require objects to be stringifiable. Then you give example of React and you say "we cannot format it to a string because then we'd end up with [object Object]." Then you say that objects that can stringify themselves are are everywhere (except for React, I suppose?) What is it?

My statements above do not contradict each other. In JS, anything can be coerced to a string with String(), it's just that for most object the resulting "[object Object]" is useless. In the comment that you link to, that stringifyParts() will indeed match exactly the formatToString() output. If the input contained e.g. a React component, it would end up in the string as "[object Object]" either way.

I really tried to see if we can make this proposal work your way, with "a tree of nodes." I didn't push too much for the attributed strings. At least not in the last few days of interactions. But answering piece-meal, with one answer contradicting previous answers, or solving one issue, but not ignoring other, will not move the discutient forward.

I do not think I will be able to convince you. I think you've decided that the approach proposed here isn't good, and you refuse to ever admit being wrong. I find discussions with you challenging, and they have been literally (not figuratively) the most draining part of my work for the past few years.

You refuse to discuss any one item at a time, pivoting to others when it seems like continuing on a track will lead to a conclusion you don't like, and then when returning you discard any previous developments not to your liking. Sure, this might be just me, but I much prefer having this dialogue written down, where it's recorded.

Do you think that all 4 bullets listed #463 (comment) are reasonable? 4 can't be solved, with "tree of parts" not matching the attributed strings already used by iOS, Android, and ICU. At least the other 1-3 are reasonable?

All your concerns are reasonable, but they should also be considered resolved by my preceding answers. Specifically, I've answered 1, 2, and 4 in #463 (comment) and 3 in #463 (comment). If those answers are not satisfactory, then please re-read them, and tell me specifically why or how they do not address the concerns you've previously expressed, or if they contain any parts that you do not understand. Our discussion since then seems to have been going around in circles.

Also, I reiterate my earlier request to expand on the attributed string alternative in this proposal, in case you find it lacking, and should you wish for it to be seriously considered as an alternative.

Copy link
Collaborator

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observing the ongoing discussion, I'd like to propose to descope formatted parts from 2.0.

I think we should still recommend that implementations provide an API returning parts, but re-reading the doc, I don't think there's a compelling case for why to define such API in any more detail than what's currently presented in the "Requirements" section.

In fact, it occurs to me that our written requirements are the recommendation for implementers:

  • Define an iterable sequence of formatted part objects.
  • Include metadata for each part, such as type, source, direction, and locale.
  • Allow the representation of non-string values.
  • Allow the representation of values that consist of an iterable sequence of formatted parts.
  • Be able to represent each resolved value of a pattern with any number of formatted parts, including none.
  • [Removed the last one, because it's the only one that's not a recommendation to implementers.]

We struggle because we want to be super-agnostic on one side, and on the other, we want to specify something concrete so that compliant implementations can be swapped and can communicate with each other. However, I note that runtime fungibility and IPC are not listed in the use-cases of this design. Are we solving the right thing?

If runtime interoperability is not a goal, what do we actually gain from defining this optional agnostic interface as part of the spec?

@aphillips
Copy link
Member

(chair hat on)

@eemeli @mihnita I observe that the discussion is starting to get personal. That's usually sign that we should switch to something like a slack huddle or at least take a breather.

I'll remind everyone on the thread to focus on technical arguments and not stray into speculation about motives, etc.


(chair hat OFF)

@stasm Thank you for your comment above. I think I observe the same thing.

I'll note that this is a design document and the "alternatives considered" section is extremely thin. The alternative @mihnita prefers has a single (somewhat biased) sentence in it.

In other words, this document is not really ready in light of our other designs, as it is more for a development container for a specific solution. I think the text is valuable. We should consider what our approach to format to parts should be, perhaps in Monday's call.

@aphillips aphillips merged commit 1d59585 into unicode-org:main Dec 4, 2023
1 check passed
@eemeli eemeli deleted the fmt-parts-design branch December 4, 2023 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agenda+ Requested for upcoming teleconference design Design principles, decisions formatting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decide on formatting to something other than text
5 participants